Skip to content

experimental/ssh: show compute provisioning status during ssh connect startup#5576

Merged
TanishqDatabricks merged 4 commits into
mainfrom
ssh-connect-gpu-startup-ux
Jun 23, 2026
Merged

experimental/ssh: show compute provisioning status during ssh connect startup#5576
TanishqDatabricks merged 4 commits into
mainfrom
ssh-connect-gpu-startup-ux

Conversation

@TanishqDatabricks

@TanishqDatabricks TanishqDatabricks commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Changes

While the SSH server bootstrap job's compute spins up, the spinner now reads Waiting for compute to start... (all connection types) instead of Starting SSH server.... For GPU accelerators, a persistent notice is printed upfront: Waiting for GPU_8xH100 compute to be provisioned. This can take upwards of 10 minutes depending on capacity....

Why

ssh connect --accelerator=GPU_8xH100 frequently fails with:

Error: failed to ensure that ssh server is running: failed to submit and start ssh server job: timed out: waiting for task to start (current state: PENDING)

GPU_8xH100 launch latency is ~10 minutes at P50 and ~30 minutes at P90, so sessions routinely hit the startup timeout even when nothing is wrong. Nothing in the output indicated that compute was being provisioned, so users read the error as a service outage.

Tests

  • go build, go vet, and go test ./experimental/ssh/... all pass; TestWaitForJobToStartSurfacesFailure updated for the waitForJobToStart signature change.
  • The change is display-only (spinner and notice text); no control flow or error behavior is modified.

This pull request and its description were written by Isaac.

… startup

GPU_8xH100 serverless capacity takes ~10 minutes at P50 and ~30 minutes at
P90 to acquire, but while waiting `ssh connect` only showed a generic
"Starting SSH server... (task: PENDING)" spinner, so users assumed a long
wait meant a service outage (see the Zillow report in
#remote-development-help).

Show "Waiting for compute to start..." while the bootstrap job's compute
spins up (all connection types, including dedicated-cluster auto-start),
and print an upfront notice for GPU accelerators that provisioning can
take upwards of 10 minutes.

The startup timeout increase for GPU accelerators is handled separately.

Co-authored-by: Isaac
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 569e075

Run: 27999475012

Env 🟨​KNOWN ✅​pass 🙈​skip Time
🟨​ aws linux 1 216 99 7:35
🟨​ aws windows 1 218 97 2:35
🟨​ aws-ucws linux 1 297 18 3:41
🟨​ aws-ucws windows 1 299 16 3:24
🟨​ azure linux 1 216 98 5:34
🟨​ azure windows 1 218 96 2:30
🟨​ azure-ucws linux 1 299 15 13:10
🟨​ azure-ucws windows 1 301 13 3:17
🟨​ gcp linux 1 215 100 6:27
🟨​ gcp windows 1 217 98 2:29
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K
Top 5 slowest tests (at least 2 minutes):
duration env testname
7:01 azure-ucws linux TestSQLExecScalar
6:54 aws linux TestSecretsPutSecretStringValue
5:53 gcp linux TestSecretsPutSecretStringValue
4:52 azure linux TestSecretsPutSecretStringValue
4:28 azure-ucws linux TestSecretsPutSecretStringValue

@anton-107 anton-107 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — the diff is clean and the intent is right. Two requested changes on the provisioning notice, both about the wording.

1. Differentiate the message by accelerator type

Right now GPU_1xA10 and GPU_8xH100 get the identical "upwards of 10 minutes" notice, but their provisioning latencies differ a lot — a single A10 is typically acquired much faster than an 8×H100 node. Telling an A10 user to expect 10+ minutes is misleading, and the 8×H100 case arguably warrants a stronger heads-up (P90 ~30 min).

Suggest keying the message off opts.Accelerator — e.g. a small map[string]string of accelerator → expected-time phrasing, with a generic fallback for anything not in the map. That also keeps it correct as new accelerator types are added.

2. Tighten the wording

"upwards of 10 minutes" is a touch informal and slightly misrepresents the data: with P50 ≈ 10 min it implies 10 min is the floor, when in fact roughly half the time it finishes faster — and the real pain is the ~30 min P90 that drove the 45-min timeout in #5569. Anchoring on a range is more useful to someone staring at a long PENDING state. The trailing ... also reads casual for a one-time sentence (vs. the ongoing spinner text, where it fits).

Suggested wording:

  • GPU_8xH100: Provisioning GPU_8xH100 compute. This typically takes around 10 minutes and can exceed 30 minutes when capacity is constrained.
  • GPU_1xA10: Provisioning GPU_1xA10 compute. This usually takes a few minutes, longer when capacity is constrained. (adjust to the latency we actually observe)

The matching spinner text can stay short, e.g. Provisioning GPU_8xH100 compute....

The provisioning heads-up for GPU accelerators was identical for every type
and said "upwards of 10 minutes", which is misleading: a single GPU_1xA10 is
typically acquired in a few minutes, while a GPU_8xH100 node is ~10 min at P50
and can exceed 30 min at P90.

Key the notice off the accelerator type via a small map with a generic
fallback, and anchor the wording on a range rather than a floor so it stays
useful to someone staring at a long PENDING state.

Co-authored-by: Isaac
@TanishqDatabricks TanishqDatabricks added this pull request to the merge queue Jun 23, 2026
@TanishqDatabricks TanishqDatabricks removed this pull request from the merge queue due to a manual request Jun 23, 2026
@TanishqDatabricks TanishqDatabricks added this pull request to the merge queue Jun 23, 2026
Merged via the queue into main with commit 4418351 Jun 23, 2026
22 checks passed
@TanishqDatabricks TanishqDatabricks deleted the ssh-connect-gpu-startup-ux branch June 23, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants